03-Data Visualization

Today we’ll be working with a dataset that I downloaded from the World Bank’s DataBank. The raw dataset from the DataBank is not amenable for direct analysis, so we will be working with a processed version. For those who are interested, you can find the raw dataset here and the script I used to process the data here.

First, let’s load the ggplot2 package. (Install the package with install.packages("ggplot2") if you have not done so yet.)

library(ggplot2)

Next, run the following line of code to get the dataset as a data frame in R. Don’t worry about the syntax; treat this as a magical incantation for now.

df <- read.csv("http://web.stanford.edu/~kjytay/courses/stats32-aut2019/Session%203/worldbank_data_tidy.csv",
               stringsAsFactors = FALSE)

We use the str(), head() and View() functions to see the dataset:

str(df)

## 'data.frame':    2170 obs. of  11 variables:
##  $ cty_name  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ cty_code  : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ year      : int  2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 ...
##  $ elecAccess: num  45.2 42.7 43.2 69.1 70.2 ...
##  $ gdpPerCap : num  1455 1637 1627 1807 1875 ...
##  $ compEduc  : int  9 9 9 9 9 9 9 9 9 NA ...
##  $ educPri   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ educTer   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ govEducExp: num  NA 3.46 3.44 2.52 3.43 ...
##  $ popYoung  : num  47.9 47.8 47.3 46.7 46 ...
##  $ pop       : int  28394813 29185507 30117413 31161376 32269589 33370794 34413603 35383128 36296400 37172386 ...

head(df)

##      cty_name cty_code year elecAccess gdpPerCap compEduc educPri educTer
## 1 Afghanistan      AFG 2009   45.23712  1454.663        9      NA      NA
## 2 Afghanistan      AFG 2010   42.70000  1637.378        9      NA      NA
## 3 Afghanistan      AFG 2011   43.22202  1626.765        9      NA      NA
## 4 Afghanistan      AFG 2012   69.10000  1806.764        9      NA      NA
## 5 Afghanistan      AFG 2013   70.15348  1874.766        9      NA      NA
## 6 Afghanistan      AFG 2014   89.50000  1897.526        9      NA      NA
##   govEducExp popYoung      pop
## 1         NA 47.90947 28394813
## 2    3.46196 47.79811 29185507
## 3    3.43785 47.34328 30117413
## 4    2.52441 46.73910 31161376
## 5    3.43437 46.03378 32269589
## 6    3.67390 45.28739 33370794

Here is a short summary of the variables in the dataset (other than the obvious ones):

elecAccess: % of population with access to electricity.
gdpPerCap: GDP per capita based on purchasing power parity. This is a proxy for how rich a country is.
compEduc: Number of years that children are legally obliged to attend school.
educPri: % of population aged 25 and over that attained or completed primary education.
educTer: % of population aged 25 and over that attained or completed Bachelor’s or equivalent.
govEducExp: General government expenditure on education as a % of GDP.
popYoung: Population between the ages of 0 to 14 as a % of total population.
pop: Total population.

For more details, here is the metadata file that the DataBank produced along with the raw dataset.

Try out some of the other summary functions we learnt last week. How many possible values does compEduc take on? What years does this dataset span?

Histograms

Histograms are a good way of understanding the distribution of a single continuous variable. In this dataset, we could be interested in the distribution of GDP per capita one variable of interest that is probably of greatest interest is price. Let’s plot a histogram of price to understand its distribution:

ggplot() +
    geom_histogram(data = df, mapping = aes(x = gdpPerCap))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 250 rows containing non-finite values (stat_bin).

Notice that we received a warning message: “Removed 250 rows containing non-finite values (stat_bin).”. There happened to be 250 rows that had NA as the value in the gdpPerCap column; R is warning us about it. As a data analyst, it would be a good idea to find out why these rows did not have any data.

If we don’t specify data or mapping inside the geom_histogram() function, it will look for these values in the ggplot() function. Hence, the code below will give the same result:

ggplot(data = df, mapping = aes(x = gdpPerCap)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 250 rows containing non-finite values (stat_bin).

R defaults to 30 bins for the histogram. We can change this by adding a bins argument to geom_histogram(). As you can see from the histograms below, different bin widths can give very different interpretations of the data!

ggplot(data = df) +
    geom_histogram(mapping = aes(x = gdpPerCap), bins = 2)

## Warning: Removed 250 rows containing non-finite values (stat_bin).

ggplot(data = df) +
    geom_histogram(mapping = aes(x = gdpPerCap), bins = 100)

## Warning: Removed 250 rows containing non-finite values (stat_bin).

ggplot(data = df) +
    geom_histogram(mapping = aes(x = gdpPerCap), bins = 1000)

## Warning: Removed 250 rows containing non-finite values (stat_bin).

Scatterplots

Scatterplots are good for visualizing the relationship between 2 continuous variables. For example, what’s the relationship between primary education attainment and GDP per capita? Let’s make a scatterplot of GDP per capita vs. primary education attainment:

ggplot(data = df) +
    geom_point(mapping = aes(x = educPri, y = gdpPerCap))

## Warning: Removed 1720 rows containing missing values (geom_point).

Wow, what a mess! That’s because we have so many data points being plotted over each other (this is called overplotting). Are there more points in the 70-80% range on the x-axis, or in the 95-100% range? It’s hard to tell. One way to address this is to modify the transparency of each point by adjusting “alpha”. By default, alpha = 1, which represents being fully opaque. We can reduce alpha (alpha = 0.2 means that 5 points are needed to get full opacity). Notice that alpha is not within the aes brackets since it is not determined by the data.

ggplot(data = df) +
    geom_point(mapping = aes(x = educPri, y = gdpPerCap), alpha = 0.2)

## Warning: Removed 1720 rows containing missing values (geom_point).

While alpha = 0.2 is probably too faint in this case, the characteristics of the data become more obvious.

We see that the scale on the y-axis is somewhat dominated by the points at the top. One way to make the points more spread out across the canvas is to perform a logarithmic transformation on the y-axis:

ggplot(data = df) +
    geom_point(mapping = aes(x = educPri, y = log10(gdpPerCap)), alpha = 0.5)

## Warning: Removed 1720 rows containing missing values (geom_point).

There certainly seems to be a positive relationship here, although the range in GDP per capita seems fairly wide when primary education attainment is very high.

Does the relationship change of we look at tertiary education attainment instead? We make this scatterplot below. Instead of filled circles, we could change the shape of the points manually through the shape argument (see this reference for which symbols correspond to each shape value):

ggplot(data = df) +
    geom_point(mapping = aes(x = educTer, y = log10(gdpPerCap)), shape = 4)

## Warning: Removed 1902 rows containing missing values (geom_point).

Let’s explore another relationship: years of compulsory education vs. GDP per capita. Let’s start with a scatterplot:

ggplot(data = df) +
    geom_point(mapping = aes(x = compEduc, y = log10(gdpPerCap)))

## Warning: Removed 532 rows containing missing values (geom_point).

This is not very informative. We see a lot of overplotting going on: and that’s because compEduc only takes on a few values. (Because of this, compEduc can also be thought of as a categorical variable!) Let’s use the alpha trick that we used previously:

ggplot(data = df) +
    geom_point(mapping = aes(x = compEduc, y = log10(gdpPerCap)), alpha = 0.2)

## Warning: Removed 532 rows containing missing values (geom_point).

In this case, changing alpha on its own is not going to help much, since all the points are still going to lie on one vertical line. Let’s add jitter (i.e. move the points by a small random amount) to get a better view.

ggplot(data = df) +
    geom_point(mapping = aes(x = compEduc, y = log10(gdpPerCap)), 
               alpha = 0.2, position = "jitter")

## Warning: Removed 532 rows containing missing values (geom_point).

Because jittering is such a common operation, instead of adding position = "jitter" as an argument to geom_point(), we can use geom_jitter() directly to get the same plot:

ggplot(data = df) +
    geom_jitter(mapping = aes(x = compEduc, y = log10(gdpPerCap)), 
               alpha = 0.2)

## Warning: Removed 532 rows containing missing values (geom_point).

Boxplots and violin plots

Because compEduc only takes on a few values, we can think of it as a categorical variable. In the plots above, instead of plotting the individual points for each observation, we could draw a boxplot for each value of the categorical variable.

Because compEduc is a numeric variable, we need to pass factor(compEduc) to the geom_boxplot() function instead so that it treats compEduc like a categorical variable:

ggplot(data = df) +
    geom_boxplot(mapping = aes(x = factor(compEduc), y = log10(gdpPerCap)))

## Warning: Removed 250 rows containing non-finite values (stat_boxplot).

For a violin plot, use geom_violin():

ggplot(data = df) +
    geom_violin(mapping = aes(x = factor(compEduc), y = log10(gdpPerCap)))

## Warning: Removed 250 rows containing non-finite values (stat_ydensity).

Line plots

Let’s say we want to look at how population changes over the time period that we have in our dataset. A natural first try would be the following:

ggplot(data = df) +
    geom_line(aes(x = year, y = log10(pop)))

What’s going on here?? What’s happening is that for each value of year, there are several rows associated with that value: one for each country! What we really want is one line per country. The way to achieve that is by specifying a group option:

ggplot(data = df) +
    geom_line(aes(x = year, y = log10(pop), group = cty_code), alpha = 0.2)

## Warning: Removed 7 rows containing missing values (geom_path).

The plot shows that population doesn’t change very much on the log scale, as one might expect. Do we see the same trend with the percentage of the population between 0-14 years?

ggplot(data = df) +
    geom_line(aes(x = year, y = popYoung, group = cty_code), alpha = 0.2)

## Warning: Removed 237 rows containing missing values (geom_path).

There seems to be a slight downward trend.

These plots are not that useful because we are showing data for all the countries on a single plot: it ends up making the plot look like a mess. It might be better to just look at the trends for a handful of countries of interest.

Relationships between 3 or more variables

Is there a relationship between the total population of a country and the relative proportion of the country being < 15 years old? Let’s make a scatterplot to find out:

ggplot(data = df) +
    geom_point(aes(x = log10(pop), y = popYoung))

## Warning: Removed 237 rows containing missing values (geom_point).

That’s an interesting looking plot! We see very clear streaks of points, suggesting that they belong to the same group. One way to test this is to color code the points by the country they belong to:

ggplot(data = df) +
    geom_point(aes(x = log10(pop), y = popYoung), col = cty_code)

## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomPoint, : object 'cty_code' not found

What went wrong here? Because color depends on data, we should put it within the aes() brackets. If col is outside the brackets, R interprets our intent as changing the color of all points to cty_code. But the variable cty_code does not exist in our global environment, hence the error. This is the correct syntax:

ggplot(data = df) +
    geom_point(aes(x = log10(pop), y = popYoung, col = cty_code))

## Warning: Removed 237 rows containing missing values (geom_point).

The legend ends up taking up the whole place of the plot… The code below can be used to remove the legend (we’ll learn more about themes next class):

ggplot(data = df) +
    geom_point(aes(x = log10(pop), y = popYoung, col = cty_code)) +
    theme(legend.position = "none")

## Warning: Removed 237 rows containing missing values (geom_point).

Let’s color the dots by year to see if there is a trend within each streak of dots:

ggplot(data = df) +
    geom_point(aes(x = log10(pop), y = popYoung, col = year))

## Warning: Removed 237 rows containing missing values (geom_point).

Saving a plot

To save a plot, click on the button, and click “Save as Image…” You can adjust the size of your image in the pop-up before saving it.

Session info

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2       knitr_1.24       magrittr_1.5     tidyselect_0.2.5
##  [5] munsell_0.5.0    colorspace_1.4-1 R6_2.4.0         rlang_0.4.0     
##  [9] stringr_1.4.0    dplyr_0.8.3      tools_3.6.1      grid_3.6.1      
## [13] gtable_0.3.0     xfun_0.9         withr_2.1.2      htmltools_0.3.6 
## [17] yaml_2.2.0       lazyeval_0.2.2   digest_0.6.20    assertthat_0.2.1
## [21] tibble_2.1.3     crayon_1.3.4     purrr_0.3.2      glue_1.3.1      
## [25] evaluate_0.14    rmarkdown_1.15   labeling_0.3     stringi_1.4.3   
## [29] compiler_3.6.1   pillar_1.4.2     scales_1.0.0     pkgconfig_2.0.2